Presentations built around content developed for the Epidemiologist R handbook
An excellent resource for all skill/experience levels
Direct towards specific sections for you to work through in your own time
2 hours sessions, twice a week to present key topics and answer questions
p
For this course, you will need to install 2 items:
R programming language
R Studio
Integrated Development Environment (IDE)
A very helpful resource for writing and running R code
You will need to install them in this order - First R, then R studio
“Massive Wall of Organized Documents” by Zeusandhera is licensed with CC BY-SA 2.0. To view a copy of this license, visit https://creativecommons.org/licenses/by-sa/2.0/
Setting up files and folders will make your analysis (and life!) easier
Folder structure
Naming files and folders
R Studio works best when you use its project function
Each project contains all of your inputs, outputs and code
This also makes it easier to share folders with colleagues
Projects are covered in more detail in the Epidemiologist R handbook: Chapter 6 “R Projects”
If you want to share your code with colleagues or when you return to code after several weeks/months you will be grateful that you gave your files and folders meaningful names!
Many organisations have style guides to ensure that teams can collaborate on coding projects
Key points to remember for naming
Keep the name short
Instead of “data_import_of_file_for_analysis.R”
Avoid spaces!
Instead of “import file.R”
What is an R package?
An R package is a collection of functions which you can use to import, clean, analyse and report your data
Link to Epidemiologist R handbook - 3.7 “Packages”
Packages can simplify your workflow by combining multiple steps into a smaller number of commands
Example: readxl is a package of functions used to import data from Excel to R.
install.packages(“readxl”)
We have asked R to install the package “readxl”.
The installation has been successful. You do not need to re install the packages every time you start a new project as they are saved in your library.
In R, red text does not mean there has been an error!
You will now be able to see the package in your list of packages
Now that readxl has been installed, you will be able to load it and use its functions
library(readxl)
When the package has been successfully loaded, you will see a tick mark in the box.
It is good practice to load all packages at the start of a script. This can help you to see which packages are being loaded and it ensures that you can write code without interruptions from the library command.
There is a package called pacman which can help with this process. When you run pacman::p_load you can list all of the packages you want to load. If the package has not previously been installed, pacman will install it. If the package has been installed, pacman will load it.
pacman::p_load(readxl,here)
Each package has multiple functions that you can use on your data.
To read more about a particular package, type
?readxl
Help documentation for readxl
For this example, we want to import data that is currently stored in an Excel formatted file “.xlsx”
So we can use the function read_xlsx from the readxl package
read_xlsx(here('data','AfricaCovid','AfricaCovid.xlsx'))
## New names:
## * `` -> ...1
## # A tibble: 58 x 2
## ...1 `www.hera-ngo.org`
## <chr> <chr>
## 1 <NA> org.hera@gmail.com
## 2 Country Last update
## 3 Algeria 44318
## 4 Angola 44318
## 5 Benin 44317
## 6 Botswana 44317
## 7 Burkina Faso 44317
## 8 Burundi 44318
## 9 Cameroon 44317
## 10 Central African Republic 44317
## # … with 48 more rows
But what does this show? And how can we use it?
So when we tell R to use the function read_xlsx, it reads the first sheet which is called “ReadMore”.
It looks like this is a summary sheet with information about when data for each country was last updated.
So how do we tell R to read in a different sheet from the Excel file?
Question - How many confirmed cases of COVID were recorded across Africa in July 2020?
First step - Import data from the sheet containing information on COVID cases
We can use the excel_sheets function from readxl to get the names of all sheets in the Excel workbook
excel_sheets(here('data','AfricaCovid','AfricaCovid.xlsx'))
## [1] "ReadMore" "Infected_per_day" "Recovered_per_day"
## [4] "Deceased_per_day" "Cumulative_infected" "Cumulative_recovered"
## [7] "Cumulative_deceased" "SDN FLore" "GHA Flore"
## [10] "SLE Flore" "ZAF Flore"
From this list we can see that we want to import data from the sheet “Infected_per_day”.
read_xlsx(here('data','AfricaCovid','AfricaCovid.xlsx'), sheet="Infected_per_day")
## # A tibble: 53 x 492
## ISO COUNTRY_NAME AFRICAN_REGION `43831` `43832` `43833` `43834` `43835`
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 DZA Algeria Northern Afri… 0 0 0 0 0
## 2 AGO Angola Southern Afri… 0 0 0 0 0
## 3 BEN Benin Western Africa 0 0 0 0 0
## 4 BWA Botswana Southern Afri… 0 0 0 0 0
## 5 BFA Burkina Faso Western Africa 0 0 0 0 0
## 6 BDI Burundi Central Africa 0 0 0 0 0
## 7 CMR Cameroon Central Africa 0 0 0 0 0
## 8 CAR Central African… Central Africa 0 0 0 0 0
## 9 TCD Chad Central Africa 0 0 0 0 0
## 10 COM Comoros Eastern Africa 0 0 0 0 0
## # … with 43 more rows, and 484 more variables: 43836 <dbl>, 43837 <dbl>,
## # 43838 <dbl>, 43839 <dbl>, 43840 <dbl>, 43841 <dbl>, 43842 <dbl>,
## # 43843 <dbl>, 43844 <dbl>, 43845 <dbl>, 43846 <dbl>, 43847 <dbl>,
## # 43848 <dbl>, 43849 <dbl>, 43850 <dbl>, 43851 <dbl>, 43852 <dbl>,
## # 43853 <dbl>, 43854 <dbl>, 43855 <dbl>, 43856 <dbl>, 43857 <dbl>,
## # 43858 <dbl>, 43859 <dbl>, 43860 <dbl>, 43861 <dbl>, 43862 <dbl>,
## # 43863 <dbl>, 43864 <dbl>, 43865 <dbl>, 43866 <dbl>, 43867 <dbl>,
## # 43868 <dbl>, 43869 <dbl>, 43870 <dbl>, 43871 <dbl>, 43872 <dbl>,
## # 43873 <dbl>, 43874 <dbl>, 43875 <dbl>, 43876 <dbl>, 43877 <dbl>,
## # 43878 <dbl>, 43879 <dbl>, 43880 <dbl>, 43881 <dbl>, 43882 <dbl>,
## # 43883 <dbl>, 43884 <dbl>, 43885 <dbl>, 43886 <dbl>, 43887 <dbl>,
## # 43888 <dbl>, 43889 <dbl>, 43890 <dbl>, 43891 <dbl>, 43892 <dbl>,
## # 43893 <dbl>, 43894 <dbl>, 43895 <dbl>, 43896 <dbl>, 43897 <dbl>,
## # 43898 <dbl>, 43899 <dbl>, 43900 <dbl>, 43901 <dbl>, 43902 <dbl>,
## # 43903 <dbl>, 43904 <dbl>, 43905 <dbl>, 43906 <dbl>, 43907 <dbl>,
## # 43908 <dbl>, 43909 <dbl>, 43910 <dbl>, 43911 <dbl>, 43912 <dbl>,
## # 43913 <dbl>, 43914 <dbl>, 43915 <dbl>, 43916 <dbl>, 43917 <dbl>,
## # 43918 <dbl>, 43919 <dbl>, 43920 <dbl>, 43921 <dbl>, 43922 <dbl>,
## # 43923 <dbl>, 43924 <dbl>, 43925 <dbl>, 43926 <dbl>, 43927 <dbl>,
## # 43928 <dbl>, 43929 <dbl>, 43930 <dbl>, 43931 <dbl>, 43932 <dbl>,
## # 43933 <dbl>, 43934 <dbl>, 43935 <dbl>, …
We can see a snapshot of the data from the sheet “Infected_per_day”
Objects
Tidy data
Data types
Dates
Pivoting and grouping data
Removing duplicates
Best practice in coding
So far we have installed, loaded and used a package (readxl)
But how do we use the information generated from these actions?
We assign the information to “objects”
Section in Epidemiologist for R handbook about Objects
“Everything you store in R - datasets, variables, a list of village names, a total population number, even outputs such as graphs - are objects which are assigned a name and can be referenced in later commands.”
To explain objects, we will calculate a value, assign it to an object and then use the object for a second calculation.
2+2
## [1] 4
We can assign the calculation “2+2” to an object called “a”
a <- 2+2
We can then use the object a to show the results of the calculation
a
## [1] 4
We can also use this value for further calculations such as adding 4 to the object a
a + 4
## [1] 8
b <- a+4
The result of this calculation is now stored in the object “b”
b
## [1] 8
In the previous section, we used the function read_xlsx from the package readxl to import data from an Excel spreadsheet.
But we didn’t assign this to an object, so it is not possible to use the data from the import step.
We can assign the data to an object and then conduct further analysis.
africa_covid_cases <- read_xlsx(here('data','AfricaCovid','AfricaCovid.xlsx'), sheet="Infected_per_day")
You will now see the object in the “Environment” section of R Studio.
Now the data have been assigned to the object “africa_covid_cases”, we can start to work with the data.
In the africa_covid_cases object, there are 53 obs (observations) of 492 variables.
So what does this mean?
We can look at our data to get more information
africa_covid_cases
## # A tibble: 53 x 492
## ISO COUNTRY_NAME AFRICAN_REGION `43831` `43832` `43833` `43834` `43835`
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 DZA Algeria Northern Afri… 0 0 0 0 0
## 2 AGO Angola Southern Afri… 0 0 0 0 0
## 3 BEN Benin Western Africa 0 0 0 0 0
## 4 BWA Botswana Southern Afri… 0 0 0 0 0
## 5 BFA Burkina Faso Western Africa 0 0 0 0 0
## 6 BDI Burundi Central Africa 0 0 0 0 0
## 7 CMR Cameroon Central Africa 0 0 0 0 0
## 8 CAR Central African… Central Africa 0 0 0 0 0
## 9 TCD Chad Central Africa 0 0 0 0 0
## 10 COM Comoros Eastern Africa 0 0 0 0 0
## # … with 43 more rows, and 484 more variables: 43836 <dbl>, 43837 <dbl>,
## # 43838 <dbl>, 43839 <dbl>, 43840 <dbl>, 43841 <dbl>, 43842 <dbl>,
## # 43843 <dbl>, 43844 <dbl>, 43845 <dbl>, 43846 <dbl>, 43847 <dbl>,
## # 43848 <dbl>, 43849 <dbl>, 43850 <dbl>, 43851 <dbl>, 43852 <dbl>,
## # 43853 <dbl>, 43854 <dbl>, 43855 <dbl>, 43856 <dbl>, 43857 <dbl>,
## # 43858 <dbl>, 43859 <dbl>, 43860 <dbl>, 43861 <dbl>, 43862 <dbl>,
## # 43863 <dbl>, 43864 <dbl>, 43865 <dbl>, 43866 <dbl>, 43867 <dbl>,
## # 43868 <dbl>, 43869 <dbl>, 43870 <dbl>, 43871 <dbl>, 43872 <dbl>,
## # 43873 <dbl>, 43874 <dbl>, 43875 <dbl>, 43876 <dbl>, 43877 <dbl>,
## # 43878 <dbl>, 43879 <dbl>, 43880 <dbl>, 43881 <dbl>, 43882 <dbl>,
## # 43883 <dbl>, 43884 <dbl>, 43885 <dbl>, 43886 <dbl>, 43887 <dbl>,
## # 43888 <dbl>, 43889 <dbl>, 43890 <dbl>, 43891 <dbl>, 43892 <dbl>,
## # 43893 <dbl>, 43894 <dbl>, 43895 <dbl>, 43896 <dbl>, 43897 <dbl>,
## # 43898 <dbl>, 43899 <dbl>, 43900 <dbl>, 43901 <dbl>, 43902 <dbl>,
## # 43903 <dbl>, 43904 <dbl>, 43905 <dbl>, 43906 <dbl>, 43907 <dbl>,
## # 43908 <dbl>, 43909 <dbl>, 43910 <dbl>, 43911 <dbl>, 43912 <dbl>,
## # 43913 <dbl>, 43914 <dbl>, 43915 <dbl>, 43916 <dbl>, 43917 <dbl>,
## # 43918 <dbl>, 43919 <dbl>, 43920 <dbl>, 43921 <dbl>, 43922 <dbl>,
## # 43923 <dbl>, 43924 <dbl>, 43925 <dbl>, 43926 <dbl>, 43927 <dbl>,
## # 43928 <dbl>, 43929 <dbl>, 43930 <dbl>, 43931 <dbl>, 43932 <dbl>,
## # 43933 <dbl>, 43934 <dbl>, 43935 <dbl>, …
ISO - 3 letter code assigned to each country
COUNTRY_NAME - Name of the country
AFRICAN_REGION - African region
43831, 43832, 43833 - This looks like a date format used by Excel. It is the number of days since January 1, 1970.
Show the first 5 rows of the data frame
The command head tells R that we want to see the first few rows and n= specifies how many rows we want to see.
head(africa_covid_cases, n=5)
## # A tibble: 5 x 492
## ISO COUNTRY_NAME AFRICAN_REGION `43831` `43832` `43833` `43834` `43835`
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 DZA Algeria Northern Africa 0 0 0 0 0
## 2 AGO Angola Southern Africa 0 0 0 0 0
## 3 BEN Benin Western Africa 0 0 0 0 0
## 4 BWA Botswana Southern Africa 0 0 0 0 0
## 5 BFA Burkina Faso Western Africa 0 0 0 0 0
## # … with 484 more variables: 43836 <dbl>, 43837 <dbl>, 43838 <dbl>,
## # 43839 <dbl>, 43840 <dbl>, 43841 <dbl>, 43842 <dbl>, 43843 <dbl>,
## # 43844 <dbl>, 43845 <dbl>, 43846 <dbl>, 43847 <dbl>, 43848 <dbl>,
## # 43849 <dbl>, 43850 <dbl>, 43851 <dbl>, 43852 <dbl>, 43853 <dbl>,
## # 43854 <dbl>, 43855 <dbl>, 43856 <dbl>, 43857 <dbl>, 43858 <dbl>,
## # 43859 <dbl>, 43860 <dbl>, 43861 <dbl>, 43862 <dbl>, 43863 <dbl>,
## # 43864 <dbl>, 43865 <dbl>, 43866 <dbl>, 43867 <dbl>, 43868 <dbl>,
## # 43869 <dbl>, 43870 <dbl>, 43871 <dbl>, 43872 <dbl>, 43873 <dbl>,
## # 43874 <dbl>, 43875 <dbl>, 43876 <dbl>, 43877 <dbl>, 43878 <dbl>,
## # 43879 <dbl>, 43880 <dbl>, 43881 <dbl>, 43882 <dbl>, 43883 <dbl>,
## # 43884 <dbl>, 43885 <dbl>, 43886 <dbl>, 43887 <dbl>, 43888 <dbl>,
## # 43889 <dbl>, 43890 <dbl>, 43891 <dbl>, 43892 <dbl>, 43893 <dbl>,
## # 43894 <dbl>, 43895 <dbl>, 43896 <dbl>, 43897 <dbl>, 43898 <dbl>,
## # 43899 <dbl>, 43900 <dbl>, 43901 <dbl>, 43902 <dbl>, 43903 <dbl>,
## # 43904 <dbl>, 43905 <dbl>, 43906 <dbl>, 43907 <dbl>, 43908 <dbl>,
## # 43909 <dbl>, 43910 <dbl>, 43911 <dbl>, 43912 <dbl>, 43913 <dbl>,
## # 43914 <dbl>, 43915 <dbl>, 43916 <dbl>, 43917 <dbl>, 43918 <dbl>,
## # 43919 <dbl>, 43920 <dbl>, 43921 <dbl>, 43922 <dbl>, 43923 <dbl>,
## # 43924 <dbl>, 43925 <dbl>, 43926 <dbl>, 43927 <dbl>, 43928 <dbl>,
## # 43929 <dbl>, 43930 <dbl>, 43931 <dbl>, 43932 <dbl>, 43933 <dbl>,
## # 43934 <dbl>, 43935 <dbl>, …
Show the last 7 rows
tail(africa_covid_cases, n=7)
## # A tibble: 7 x 492
## ISO COUNTRY_NAME AFRICAN_REGION `43831` `43832` `43833` `43834` `43835`
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SDN Sudan Eastern Africa 0 0 0 0 0
## 2 TZA Tanzania Eastern Africa 0 0 0 0 0
## 3 TGO Togo Western Africa 0 0 0 0 0
## 4 TUN Tunisia Northern Africa 0 0 0 0 0
## 5 UGA Uganda Eastern Africa 0 0 0 0 0
## 6 ZMB Zambia Southern Africa 0 0 0 0 0
## 7 ZWE Zimbabwe Southern Africa 0 0 0 0 0
## # … with 484 more variables: 43836 <dbl>, 43837 <dbl>, 43838 <dbl>,
## # 43839 <dbl>, 43840 <dbl>, 43841 <dbl>, 43842 <dbl>, 43843 <dbl>,
## # 43844 <dbl>, 43845 <dbl>, 43846 <dbl>, 43847 <dbl>, 43848 <dbl>,
## # 43849 <dbl>, 43850 <dbl>, 43851 <dbl>, 43852 <dbl>, 43853 <dbl>,
## # 43854 <dbl>, 43855 <dbl>, 43856 <dbl>, 43857 <dbl>, 43858 <dbl>,
## # 43859 <dbl>, 43860 <dbl>, 43861 <dbl>, 43862 <dbl>, 43863 <dbl>,
## # 43864 <dbl>, 43865 <dbl>, 43866 <dbl>, 43867 <dbl>, 43868 <dbl>,
## # 43869 <dbl>, 43870 <dbl>, 43871 <dbl>, 43872 <dbl>, 43873 <dbl>,
## # 43874 <dbl>, 43875 <dbl>, 43876 <dbl>, 43877 <dbl>, 43878 <dbl>,
## # 43879 <dbl>, 43880 <dbl>, 43881 <dbl>, 43882 <dbl>, 43883 <dbl>,
## # 43884 <dbl>, 43885 <dbl>, 43886 <dbl>, 43887 <dbl>, 43888 <dbl>,
## # 43889 <dbl>, 43890 <dbl>, 43891 <dbl>, 43892 <dbl>, 43893 <dbl>,
## # 43894 <dbl>, 43895 <dbl>, 43896 <dbl>, 43897 <dbl>, 43898 <dbl>,
## # 43899 <dbl>, 43900 <dbl>, 43901 <dbl>, 43902 <dbl>, 43903 <dbl>,
## # 43904 <dbl>, 43905 <dbl>, 43906 <dbl>, 43907 <dbl>, 43908 <dbl>,
## # 43909 <dbl>, 43910 <dbl>, 43911 <dbl>, 43912 <dbl>, 43913 <dbl>,
## # 43914 <dbl>, 43915 <dbl>, 43916 <dbl>, 43917 <dbl>, 43918 <dbl>,
## # 43919 <dbl>, 43920 <dbl>, 43921 <dbl>, 43922 <dbl>, 43923 <dbl>,
## # 43924 <dbl>, 43925 <dbl>, 43926 <dbl>, 43927 <dbl>, 43928 <dbl>,
## # 43929 <dbl>, 43930 <dbl>, 43931 <dbl>, 43932 <dbl>, 43933 <dbl>,
## # 43934 <dbl>, 43935 <dbl>, …
How many unique countries are in the data?
unique(africa_covid_cases$COUNTRY_NAME)
## [1] "Algeria" "Angola"
## [3] "Benin" "Botswana"
## [5] "Burkina Faso" "Burundi"
## [7] "Cameroon" "Central African Republic"
## [9] "Chad" "Comoros"
## [11] "Congo" "Cote d'Ivoire"
## [13] "Democratic Republic of the Congo" "Djibouti"
## [15] "Egypt" "Equatorial Guinea"
## [17] "Eritrea" "Eswatini"
## [19] "Ethiopia" "Gabon"
## [21] "Gambia" "Ghana"
## [23] "Guinea" "Guinea-Bissau"
## [25] "Kenya" "Lesotho"
## [27] "Liberia" "Libya"
## [29] "Madagascar" "Malawi"
## [31] "Mali" "Mauritania"
## [33] "Mauritius" "Mayotte"
## [35] "Morocco" "Mozambique"
## [37] "Namibia" "Niger"
## [39] "Nigeria" "Rwanda"
## [41] "Sao Tome and Principe" "Senegal"
## [43] "Sierra Leone" "Somalia"
## [45] "South Africa" "South Sudan"
## [47] "Sudan" "Tanzania"
## [49] "Togo" "Tunisia"
## [51] "Uganda" "Zambia"
## [53] "Zimbabwe"
There are 53 unique country values. This is helpful as there are also 53 rows so we can say that each row represents a country.
We can assign the list of unique countries to an object for future reference
country_list <- unique(africa_covid_cases$COUNTRY_NAME)
In the previous step, the following command was used
unique(africa_covid_cases$COUNTRY_NAME)
What does “$” do in R?
It allows us to look at a specific variable within the dataset
unique(africa_covid_cases$AFRICAN_REGION)
## [1] "Northern Africa" "Southern Africa" "Western Africa" "Central Africa"
## [5] "Eastern Africa"
And again we can assign this to an object
region_list <- unique(africa_covid_cases$AFRICAN_REGION)
When using R, there are many approaches you can use to reach the same result.
There are thousands of packages with many functions and sometimes these packages can overlap.
This can be confusing when you are starting to learn R.
There is a collection of packages with many of the most commonly used packages and this is called the tidyverse.
tidyverse::tidyverse_packages()
## [1] "broom" "cli" "crayon" "dbplyr"
## [5] "dplyr" "dtplyr" "forcats" "googledrive"
## [9] "googlesheets4" "ggplot2" "haven" "hms"
## [13] "httr" "jsonlite" "lubridate" "magrittr"
## [17] "modelr" "pillar" "purrr" "readr"
## [21] "readxl" "reprex" "rlang" "rstudioapi"
## [25] "rvest" "stringr" "tibble" "tidyr"
## [29] "xml2" "tidyverse"
We will use functions from some of these packages over the next few sessions.
The key concept when working with packages from the tidyverse is the concept of “tidy data”.
R for Epidemiologist handbook 4.1 From Excel - Tidy data
Principles of “tidy data”:
Functions from the tidyverse packages are set up to work with tidy data.
If your data are not tidy, then you will have to restructure the data to a tidy format.
Restructuring can take a lot of time if the data are stored in Excel spreadsheets with a lot of formatting/merged columns.
In a previous step, we imported COVID case data from an Excel spreadsheet.
But how do we know if the data are “tidy”
Remember there are 3 principles
head(africa_covid_cases, n=3)
## # A tibble: 3 x 492
## ISO COUNTRY_NAME AFRICAN_REGION `43831` `43832` `43833` `43834` `43835`
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 DZA Algeria Northern Africa 0 0 0 0 0
## 2 AGO Angola Southern Africa 0 0 0 0 0
## 3 BEN Benin Western Africa 0 0 0 0 0
## # … with 484 more variables: 43836 <dbl>, 43837 <dbl>, 43838 <dbl>,
## # 43839 <dbl>, 43840 <dbl>, 43841 <dbl>, 43842 <dbl>, 43843 <dbl>,
## # 43844 <dbl>, 43845 <dbl>, 43846 <dbl>, 43847 <dbl>, 43848 <dbl>,
## # 43849 <dbl>, 43850 <dbl>, 43851 <dbl>, 43852 <dbl>, 43853 <dbl>,
## # 43854 <dbl>, 43855 <dbl>, 43856 <dbl>, 43857 <dbl>, 43858 <dbl>,
## # 43859 <dbl>, 43860 <dbl>, 43861 <dbl>, 43862 <dbl>, 43863 <dbl>,
## # 43864 <dbl>, 43865 <dbl>, 43866 <dbl>, 43867 <dbl>, 43868 <dbl>,
## # 43869 <dbl>, 43870 <dbl>, 43871 <dbl>, 43872 <dbl>, 43873 <dbl>,
## # 43874 <dbl>, 43875 <dbl>, 43876 <dbl>, 43877 <dbl>, 43878 <dbl>,
## # 43879 <dbl>, 43880 <dbl>, 43881 <dbl>, 43882 <dbl>, 43883 <dbl>,
## # 43884 <dbl>, 43885 <dbl>, 43886 <dbl>, 43887 <dbl>, 43888 <dbl>,
## # 43889 <dbl>, 43890 <dbl>, 43891 <dbl>, 43892 <dbl>, 43893 <dbl>,
## # 43894 <dbl>, 43895 <dbl>, 43896 <dbl>, 43897 <dbl>, 43898 <dbl>,
## # 43899 <dbl>, 43900 <dbl>, 43901 <dbl>, 43902 <dbl>, 43903 <dbl>,
## # 43904 <dbl>, 43905 <dbl>, 43906 <dbl>, 43907 <dbl>, 43908 <dbl>,
## # 43909 <dbl>, 43910 <dbl>, 43911 <dbl>, 43912 <dbl>, 43913 <dbl>,
## # 43914 <dbl>, 43915 <dbl>, 43916 <dbl>, 43917 <dbl>, 43918 <dbl>,
## # 43919 <dbl>, 43920 <dbl>, 43921 <dbl>, 43922 <dbl>, 43923 <dbl>,
## # 43924 <dbl>, 43925 <dbl>, 43926 <dbl>, 43927 <dbl>, 43928 <dbl>,
## # 43929 <dbl>, 43930 <dbl>, 43931 <dbl>, 43932 <dbl>, 43933 <dbl>,
## # 43934 <dbl>, 43935 <dbl>, …
So are the data “tidy”?
The data from the spreadsheet are not “tidy”.
The columns “43831, 43832, 43833…” represent different dates. Therefore, this does meet the second argument of “tidy data” - “Each observation must have its own row”.But we can reformat the data to make it “tidy” using functions from the packages included in the tidyverse
Remember, first we must install the packages from the tidyverse
install.packages("tidyverse")
Now that the tidyverse has been installed, we can use the functions from the packages to “tidy” the data.One package which is very helpful for this is called "tidyr. Instead of loading individual packages, we can load the core tidyverse packages with one command
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.4 ✓ purrr 0.3.4
## ✓ tibble 3.1.2 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
The core packages contain powerful functions we can use to process, analyse and visualise data.
Remember to look at the documentation for a package type “?[name of package]”
Example -
?tidyr
To look at the functions within a package, type [name of package]::
Example
tidyr::
To reformat the data to a tidy format, we need to transform the data from wide to long.
The Epidemiologist R handbook has an excellent section describing how to do this
12 - Pivoting data
africa_covid_cases_long <- africa_covid_cases %>%
pivot_longer(cols=4:492, names_to="excel_date", values_to="cases")
Transforming data from wide to long usually requires a few attempts to ensure you have achieved the correct outcome!
head(africa_covid_cases_long, n=3)
## # A tibble: 3 x 5
## ISO COUNTRY_NAME AFRICAN_REGION excel_date cases
## <chr> <chr> <chr> <chr> <dbl>
## 1 DZA Algeria Northern Africa 43831 0
## 2 DZA Algeria Northern Africa 43832 0
## 3 DZA Algeria Northern Africa 43833 0
This looks correct!
You can add comments to code to show other people (and remind yourself!) why you wrote the code in a particular way
africa_covid_cases_long <-
africa_covid_cases %>% #tell R to use this dataset
pivot_longer(cols = 4:492,#select the columns you want
names_to = "excel_date", #name the new date column
values_to = "cases") #name the new cases column
To add to the confusion, Excel has 2 additional date systems:
1900 date system
1904 date system
In the data set we are using, the dates are in this format:
head(africa_covid_cases_long$excel_date)
## [1] "43831" "43832" "43833" "43834" "43835" "43836"
We can use a function from another package to convert this to a standard date format.
install.packages("janitor")
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
The package janitor has many helpful functions for cleaning data
africa_covid_cases_long <- africa_covid_cases_long %>%
mutate(date_format=excel_numeric_to_date(as.numeric(excel_date)))
head(africa_covid_cases_long$date_format)
## [1] "2020-01-01" "2020-01-02" "2020-01-03" "2020-01-04" "2020-01-05"
## [6] "2020-01-06"
The new variable created “date_format” is in the format YEAR-MONTH-DATE.
We can also check if the values in the new variable look correct
min(africa_covid_cases_long$date_format) #minimum date
## [1] "2020-01-01"
max(africa_covid_cases_long$date_format) #maximum date
## [1] "2021-05-03"
We know this is a data set of COVID cases so the date range (from the start of 2020 through to May of 2021) looks to be correct.
Looking at your data
Building your analysis dataset
Answering questions with data
Missing data
Grouping and pivoting data
Filtering data
Section from Epidemiologist R handbook- 17 Descriptive Tables
There are many functions available to look at descriptive statistics from your dataset. For this example we will use a function that is included in the basic installation of R.
summary(africa_covid_cases_long)
## ISO COUNTRY_NAME AFRICAN_REGION excel_date
## Length:25917 Length:25917 Length:25917 Length:25917
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## cases date_format
## Min. : -209.0 Min. :2020-01-01
## 1st Qu.: 0.0 1st Qu.:2020-05-02
## Median : 7.0 Median :2020-09-01
## Mean : 177.6 Mean :2020-09-01
## 3rd Qu.: 78.0 3rd Qu.:2021-01-01
## Max. :21980.0 Max. :2021-05-03
## NA's :231
This function provides useful information which we can use for building our approach to designing.
For example, we can see from the summary of the date variable that the first record (Min) is from 2020-01-01 and the last record (Max) is from 2021-05-03,
For the cases variable, the maximum number of cases recorded on one day was 6,195.
Before analysing the data, it is a good idea to generate a new dataset which only contains the variables you need to analyse.
So what variables do we have in africa_covid_cases_long
names(africa_covid_cases_long)
## [1] "ISO" "COUNTRY_NAME" "AFRICAN_REGION" "excel_date"
## [5] "cases" "date_format"
We can select the variables we want to keep using the select function from the dplyr package
dplyr is a core part of the tidyverse so it is loaded when you write library(tidyverse)
analysis_dataset <- africa_covid_cases_long %>%
select(date_format,AFRICAN_REGION, COUNTRY_NAME, cases)
We can look at the first few rows of the dataset we have created to check we have selected the desired variables.
head(analysis_dataset)
## # A tibble: 6 x 4
## date_format AFRICAN_REGION COUNTRY_NAME cases
## <date> <chr> <chr> <dbl>
## 1 2020-01-01 Northern Africa Algeria 0
## 2 2020-01-02 Northern Africa Algeria 0
## 3 2020-01-03 Northern Africa Algeria 0
## 4 2020-01-04 Northern Africa Algeria 0
## 5 2020-01-05 Northern Africa Algeria 0
## 6 2020-01-06 Northern Africa Algeria 0
The select function from the dplyr package is very useful.
It can also be used to rename selected variables
analysis_dataset <- africa_covid_cases_long %>%
select(date=date_format,region=AFRICAN_REGION, country=COUNTRY_NAME, cases)
We have renamed AFRICAN_REGION and COUNTRY_NAME as region and country
head(analysis_dataset)
## # A tibble: 6 x 4
## date region country cases
## <date> <chr> <chr> <dbl>
## 1 2020-01-01 Northern Africa Algeria 0
## 2 2020-01-02 Northern Africa Algeria 0
## 3 2020-01-03 Northern Africa Algeria 0
## 4 2020-01-04 Northern Africa Algeria 0
## 5 2020-01-05 Northern Africa Algeria 0
## 6 2020-01-06 Northern Africa Algeria 0
The Epidemiologist R handbook has several comprehensive sections focusing on data analysis. We will continue to work with the dataset we have built while applying some of the examples from the handbook.
So far we have:
Imported the data from an Excel worksheet
Reshaped the data into a “tidy” format
Changed the format of a variable to a date
Selected only the variables we want to use for the analysis
Now we can start to use the dataset to answer questions
The dplyr package contains many useful functions for analysing data.
Some of these functions are covered in the Epidemiologist R Handbook - Section 17.4
We will use some of these functions to answer questions using our dataset.
How many confirmed cases of COVID-19 have been recorded in Africa?
analysis_dataset %>% # Tell R what dataset we want to use
summarise(total_covid_cases=sum(cases)) #Tell R what function we want to apply to the data
## # A tibble: 1 x 1
## total_covid_cases
## <dbl>
## 1 NA
The answer is “NA”, which stands for “Not Available”
This is a good example of how R deals with missing data
There may be dates in our dataset where there were no confirmed cases of COVID-19 recorded
When data are missing, R will display “NA” for the variable
If you try to run a calculation on data where there is one or more “NA” values, the results will be “NA”
There are several options for dealing with missing values in R
Complete case analysis
full_dataset <- na.omit(analysis_dataset)
Exclude “NA” values from calculations
analysis_dataset %>%
summarise(total_covid_cases=sum(cases, na.rm=TRUE))
## # A tibble: 1 x 1
## total_covid_cases
## <dbl>
## 1 4561465
This command has now excluded NA values and has provided us with an answer for the number of confirmed COVID-19 cases in Africa - 4,561,465
How many confirmed cases of COVID-19 have been recorded in Africa, by region?
analysis_dataset %>%
group_by(region) %>%
summarise(total_covid_cases=sum(cases, na.rm=TRUE))
## # A tibble: 5 x 2
## region total_covid_cases
## <chr> <dbl>
## 1 Central Africa 161353
## 2 Eastern Africa 622537
## 3 Northern Africa 1371469
## 4 Southern Africa 1970137
## 5 Western Africa 435969
group_by is a very powerful function for summarising data.
The arrange function can be used to organise the results. In this case we have instructed R to sort the results by the total_covid_cases variable, from highest to lowest value.
analysis_dataset %>%
group_by(region) %>%
summarise(total_covid_cases=sum(cases, na.rm=TRUE)) %>%
arrange(-total_covid_cases)
## # A tibble: 5 x 2
## region total_covid_cases
## <chr> <dbl>
## 1 Southern Africa 1970137
## 2 Northern Africa 1371469
## 3 Eastern Africa 622537
## 4 Western Africa 435969
## 5 Central Africa 161353
We can add multiple variables to group_by
If we add region and country to the group_by command, sort from highest to lowest, we can see which countries reported the most confirmed COVID-19 cases
analysis_dataset %>%
group_by(region, country) %>%
summarise(total_covid_cases=sum(cases, na.rm=TRUE)) %>%
arrange(-total_covid_cases)
## `summarise()` has grouped output by 'region'. You can override using the `.groups` argument.
## # A tibble: 53 x 3
## # Groups: region [5]
## region country total_covid_cases
## <chr> <chr> <dbl>
## 1 Southern Africa South Africa 1584064
## 2 Northern Africa Morocco 511856
## 3 Northern Africa Tunisia 311743
## 4 Eastern Africa Ethiopia 258353
## 5 Northern Africa Egypt 228584
## 6 Northern Africa Libya 178335
## 7 Western Africa Nigeria 165199
## 8 Eastern Africa Kenya 160559
## 9 Northern Africa Algeria 122522
## 10 Western Africa Ghana 92683
## # … with 43 more rows
Another useful function is filter which can be used to apply filters to calculations
We can repeat the previous calculation, but then add a filter to only include results from countries in Northern Africa
analysis_dataset %>%
group_by(region, country) %>%
summarise(total_covid_cases=sum(cases, na.rm=TRUE)) %>%
arrange(-total_covid_cases) %>%
filter(region=="Northern Africa")
## `summarise()` has grouped output by 'region'. You can override using the `.groups` argument.
## # A tibble: 6 x 3
## # Groups: region [1]
## region country total_covid_cases
## <chr> <chr> <dbl>
## 1 Northern Africa Morocco 511856
## 2 Northern Africa Tunisia 311743
## 3 Northern Africa Egypt 228584
## 4 Northern Africa Libya 178335
## 5 Northern Africa Algeria 122522
## 6 Northern Africa Mauritania 18429
The filter can be applied at any point within the calculation. For very complex calculations, it is helpful to apply the filter as early as possible. This reduces the number of records before the complex portion of the calculation occurs.
filter can also be used to make data frames
northern_africa <- analysis_dataset %>%
filter(region=="Northern Africa")
Using filters, we can answer additional questions.
What percentage of North Africa’s confirmed COVID-19 cases were recorded in each country in North Africa?
#to convert the calculation to percentage we will need to install an additional package
#install.packages("scales")
northern_africa %>%
group_by(country) %>%
summarise(total_covid_cases=sum(cases, na.rm=TRUE)) %>%
mutate(percentage=scales::percent(total_covid_cases/sum(total_covid_cases))) %>%
#with this mutate command we are telling r to divide the total number of covid cases for each country by the total number of covid cases for all countries in northern africa
arrange(-total_covid_cases)
## # A tibble: 6 x 3
## country total_covid_cases percentage
## <chr> <dbl> <chr>
## 1 Morocco 511856 37.3%
## 2 Tunisia 311743 22.7%
## 3 Egypt 228584 16.7%
## 4 Libya 178335 13.0%
## 5 Algeria 122522 8.9%
## 6 Mauritania 18429 1.3%
We can store the result as a data frame by assigning the calculation to an object.
northern_africa_cases_country <- northern_africa %>%
group_by(country) %>%
summarise(total_covid_cases=sum(cases, na.rm=TRUE)) %>%
mutate(percentage=scales::percent(total_covid_cases/sum(total_covid_cases))) %>%
#with this mutate command we are telling r to divide the total number of covid cases for each country by the total number of covid cases for all countries in northern africa
arrange(-total_covid_cases)
When was the first confirmed case of COVID-19 in Northern Africa?
northern_africa %>%
filter(cases>0) %>%
filter(date == min(date, na.rm=TRUE))
## # A tibble: 1 x 4
## date region country cases
## <date> <chr> <chr> <dbl>
## 1 2020-02-14 Northern Africa Egypt 1
Here we have added 2 filters:
Only keep records where the value for cases is higher than 0
Only keep records where the value for date is equal to the minimum value for date. We have also added the na.rm=TRUE command from a previous step. If you don’t know the data very well, it is good practice to add this command.
When was the first confirmed case of COVID-19 in Northern Africa, by country?
northern_africa %>%
group_by(country) %>%
filter(cases>0) %>%
filter(date == min(date, na.rm=TRUE))
## # A tibble: 6 x 4
## # Groups: country [6]
## date region country cases
## <date> <chr> <chr> <dbl>
## 1 2020-02-25 Northern Africa Algeria 1
## 2 2020-02-14 Northern Africa Egypt 1
## 3 2020-03-24 Northern Africa Libya 1
## 4 2020-03-13 Northern Africa Mauritania 1
## 5 2020-03-02 Northern Africa Morocco 1
## 6 2020-03-02 Northern Africa Tunisia 1
The filter function can also be used to exclude certain records from the analysis
northern_africa %>%
group_by(country) %>%
filter(cases>0) %>%
filter(date == min(date, na.rm=TRUE)) %>%
#filter(!country=="Tunisia") %>%
filter(country!="Tunisia") #both methods for excluding results (in this case excluding results where the value for country is Tunisia) can be used
## # A tibble: 5 x 4
## # Groups: country [5]
## date region country cases
## <date> <chr> <chr> <dbl>
## 1 2020-02-25 Northern Africa Algeria 1
## 2 2020-02-14 Northern Africa Egypt 1
## 3 2020-03-24 Northern Africa Libya 1
## 4 2020-03-13 Northern Africa Mauritania 1
## 5 2020-03-02 Northern Africa Morocco 1
These results can be stored in an object for future use
first_cases_northern_africa <- northern_africa %>%
group_by(country) %>%
filter(cases>0) %>%
filter(date == min(date, na.rm=TRUE))
On what date, was the 100th case of COVID-19 reported from each country in Northern Africa?
northern_africa %>%
group_by(country) %>%
mutate(cumulative_cases=cumsum(cases)) %>%
filter(cumulative_cases>=100) %>%
slice(1) %>%
pull(date, country)
## Algeria Egypt Libya Mauritania Morocco Tunisia
## "2020-03-20" "2020-03-15" "2020-05-28" "2020-05-19" "2020-03-22" "2020-03-24"
Here we have introduced two new functions slice and pull
slice can be used to select specific rows from a dataset. In this case, we have added a column which is the cumulative number of cases, selected the first row after filtering the dataset to only include results where the value is greater than or equal to 100, and then selected the first row using the slice command.
An additional function is the pull command. This is useful when you want to extract specific values from the result.
first_100cases <- northern_africa %>%
group_by(country) %>%
mutate(cumulative_cases=cumsum(cases)) %>%
filter(cumulative_cases>=100) %>%
slice(1) %>%
pull(date, country)
The dataset is currently set up so that each row contains information on the number of recorded COVID-19 cases for a specific date for a specified country. One calculation which we may be interested in is the overall trend of case numbers over a period of time. For this, we can calculate cumulative values and averages to identify any trends in the data.
To demonstrate this, we will filter the dataset to only include 1 country - in this case, Morocco.
morocco_covid_cases <- northern_africa %>%
filter(country=="Morocco")
When data are collected on a daily basis, it can be helpful to apply functions to improve the interpretation of trends which may be present in the data. For example, with this COVID dataset, data are available for 489 days between January 1, 2020 & May 3, 2021. There will be some days when 0 cases are reported and there will be some days when many more cases are reported. Some of these differences may be due to delays in reporting cases if ,for example, reporting does not take place at the weekend.
There are a number of functions in the zoo package which can help us to partially account for reporting delays.
pacman::p_load(zoo)
morocco_covid_cases_mean <- morocco_covid_cases %>%
mutate(cases_7day_mean=rollmean(cases,k=7, fill=NA))
We have now created a new variable which calculates the 7-day moving average of cases. In the visualisation session of this training, we will compare the graphs of cases to the seven-day moving average to show the difference between the two indicators.
If you wanted to calculate a moving average over a longer time period, you can adjust the number after k=
morocco_covid_cases_mean <- morocco_covid_cases %>%
mutate(cases_7day_mean=rollmean(cases,k=7, fill=NA)) %>%
mutate(cases_14day_mean=rollmean(cases,k=14,fill=NA))
And use functions from another package to display the information in a more user-friendly table.
The gt package provides a very flexible interface for building tables from your data.
pacman::p_load(gt)
The documentation describing the functions can be found here.
Below is an example using the dataset we have built.
northern_africa_cases_country_table <- northern_africa_cases_country %>%
gt() %>%
tab_header(
title = md("COVID-19 in Northern Africa")
) %>%
cols_label(
country = "Country",
total_covid_cases = "N",
percentage = "% of total cases in Northern Africa"
) %>%
tab_spanner(
label = "Confirmed cases",
columns = c(total_covid_cases,percentage)
) %>%
fmt_number(
columns = total_covid_cases,
decimals=0,
use_seps = TRUE
) %>%
cols_align(
align = "center",
columns = c(total_covid_cases, percentage)
)
northern_africa_cases_country_table
| COVID-19 in Northern Africa | ||
|---|---|---|
| Country | Confirmed cases | |
| N | % of total cases in Northern Africa | |
| Morocco | 511,856 | 37.3% |
| Tunisia | 311,743 | 22.7% |
| Egypt | 228,584 | 16.7% |
| Libya | 178,335 | 13.0% |
| Algeria | 122,522 | 8.9% |
| Mauritania | 18,429 | 1.3% |
gt has many options for customising tables. To demonstrate this, we will build a table to show when each country in Africa recorded its first COVID-19 case. This example uses some of the techniques demonstrated in this article.
first_cases_africa <- africa_covid_cases_long %>%
select(date=date_format,region=AFRICAN_REGION, country=COUNTRY_NAME, cases) %>%
group_by(region,country) %>%
filter(cases>0) %>%
filter(date == min(date, na.rm=TRUE)) %>%
ungroup()
first_cases_africa_table <- first_cases_africa %>%
select(region,country,date) %>%
group_by(region) %>%
arrange(date) %>%
gt() %>%
tab_header(
title = md("When did countries in Africa record their first case of COVID-19?")
) %>%
fmt_date(
columns = date,
date_style = 4
) %>%
opt_all_caps() %>%
#Use the Chivo font
#Note the great 'google_font' function in 'gt' that removes the need to pre-load fonts
opt_table_font(
font = list(
google_font("Chivo"),
default_fonts()
)
) %>%
cols_label(
country = "Country",
date = "Date"
) %>%
cols_align(
align = "center",
columns = c(country, date)
) %>%
tab_options(
column_labels.border.top.width = px(3),
column_labels.border.top.color = "transparent",
table.border.top.color = "transparent",
table.border.bottom.color = "transparent",
data_row.padding = px(3),
source_notes.font.size = 12,
heading.align = "left",
#Adjust grouped rows to make them stand out
row_group.background.color = "grey") %>%
tab_source_note(source_note = "Data: Compiled from national governments and WHO by Humanitarian Emergency Response Africa (HERA)")
first_cases_africa_table
| When did countries in Africa record their first case of COVID-19? | |
|---|---|
| Country | Date |
| Northern Africa | |
| Egypt | Friday 14 February 2020 |
| Algeria | Tuesday 25 February 2020 |
| Morocco | Monday 2 March 2020 |
| Tunisia | Monday 2 March 2020 |
| Mauritania | Friday 13 March 2020 |
| Libya | Tuesday 24 March 2020 |
| Western Africa | |
| Nigeria | Thursday 27 February 2020 |
| Senegal | Friday 28 February 2020 |
| Togo | Friday 6 March 2020 |
| Burkina Faso | Monday 9 March 2020 |
| Cote d'Ivoire | Wednesday 11 March 2020 |
| Ghana | Thursday 12 March 2020 |
| Guinea | Thursday 12 March 2020 |
| Benin | Monday 16 March 2020 |
| Gambia | Monday 16 March 2020 |
| Liberia | Monday 16 March 2020 |
| Niger | Thursday 19 March 2020 |
| Guinea-Bissau | Wednesday 25 March 2020 |
| Mali | Wednesday 25 March 2020 |
| Sierra Leone | Tuesday 31 March 2020 |
| Southern Africa | |
| South Africa | Thursday 5 March 2020 |
| Eswatini | Saturday 14 March 2020 |
| Namibia | Saturday 14 March 2020 |
| Zimbabwe | Friday 20 March 2020 |
| Angola | Saturday 21 March 2020 |
| Zambia | Sunday 22 March 2020 |
| Mozambique | Monday 23 March 2020 |
| Botswana | Monday 30 March 2020 |
| Malawi | Thursday 2 April 2020 |
| Lesotho | Tuesday 12 May 2020 |
| Central Africa | |
| Cameroon | Friday 6 March 2020 |
| Democratic Republic of the Congo | Tuesday 10 March 2020 |
| Gabon | Thursday 12 March 2020 |
| Central African Republic | Saturday 14 March 2020 |
| Equatorial Guinea | Saturday 14 March 2020 |
| Chad | Thursday 19 March 2020 |
| Congo | Sunday 22 March 2020 |
| Burundi | Tuesday 31 March 2020 |
| Sao Tome and Principe | Monday 6 April 2020 |
| Eastern Africa | |
| Ethiopia | Friday 13 March 2020 |
| Kenya | Friday 13 March 2020 |
| Sudan | Friday 13 March 2020 |
| Rwanda | Saturday 14 March 2020 |
| Somalia | Monday 16 March 2020 |
| Mayotte | Tuesday 17 March 2020 |
| Tanzania | Tuesday 17 March 2020 |
| Djibouti | Wednesday 18 March 2020 |
| Mauritius | Wednesday 18 March 2020 |
| Madagascar | Friday 20 March 2020 |
| Eritrea | Saturday 21 March 2020 |
| Uganda | Saturday 21 March 2020 |
| South Sudan | Monday 6 April 2020 |
| Comoros | Thursday 30 April 2020 |
| Data: Compiled from national governments and WHO by Humanitarian Emergency Response Africa (HERA) | |
One of the key strengths of R is visualising data. There are many packages which have functions you can use to make graphs, tables, maps…the list is endless!
The first package of functions we will use for visualising data is another core tidyverse package called ggplot2. This is commonly referred to as ggplot
We have already loaded the package when we ran library(tidyverse)
You can also choose to only load the ggplot2 package by typing library(ggplot2)
library(ggplot2)
The Epidemiologist R handbook has 2 sections focused on ggplot
These sections contain very helpful explanations of many of the functions available with ggplot. There are also a number of excellent references for every type of graph you want to make.
We will walk through some common examples to teach some of the most common approaches
Firstly, we will produce epicurves to describe the distribution of COVID-19 cases (y axis) over time (x axis).
[HA1]Sub-heading suggestion ?
Make a graph of confirmed COVID-19 cases in Northern Africa
ggplot(northern_africa, aes(x=date,y=cases)) +
geom_line()
## Warning: Removed 7 row(s) containing missing values (geom_path).
This command has generated a line graph of confirmed COVID-19 cases for countries in Northern Africa.
From earlier steps, we know that the dataset northern_africa contains data from multiple countries: `r unique(northern_africa$country’
We can add more information to the ggplot command to draw separate lines for each country
ggplot(northern_africa, aes(x=date,y=cases, color=country)) +
geom_line()
## Warning: Removed 9 row(s) containing missing values (geom_path).
To make the graph more presentable, we can add more options to the ggplot command
ggplot(northern_africa, aes(x=date,y=cases, color=country)) +
geom_line() +
labs(x='Date', y='Total cases', color='Country') + #label axes
theme(legend.position='top') + #place legend at top of graph
scale_x_date(date_breaks = '2 months', #set x axis to have 2 month breaks
date_minor_breaks = '1 month', #set x axis to have 1 month breaks
date_labels = '%d-%m-%y') #change label for x axis
## Warning: Removed 9 row(s) containing missing values (geom_path).
More information on plotting time-series data using ggplot can be found here.
It is still difficult to see the data for each country. There is a helpful command called facet_wrap to fix this and allow us to show multiple epicurves by country.
ggplot(northern_africa, aes(x=date,y=cases, color=country)) +
geom_col() +
labs(x='Date', y='Total cases') + #label axes
theme(legend.position='none') + #remove legend by setting position to 'none'
scale_x_date(date_breaks = '4 months', #set x axis to have 2 month breaks
date_minor_breaks = '2 months', #set x axis to have 1 month breaks
date_labels = '%m-%Y') + #change label for x axis
facet_wrap(~country) # this will create a separate graph for each country
## Warning: Removed 17 rows containing missing values (position_stack).
In a previous section, we added indicators for the rolling average and rolling sum of cases. These indicators can be helpful for identifying trends over time.
moroocco_covid_cases_graph <- morocco_covid_cases_mean %>%
ggplot() +
geom_col(aes(x=date, y=cases, color=country)) +
geom_line(aes(x=date, y=cases_7day_mean)) +
labs(x='Month-Year',
y='Total cases',
title='Cases and 7-day average (black line)') +
theme(legend.position='none') +
scale_x_date(date_breaks = '2 months',
date_minor_breaks = '1 month',
date_labels = '%m-%y')
moroocco_covid_cases_graph
## Warning: Removed 1 rows containing missing values (position_stack).
## Warning: Removed 7 row(s) containing missing values (geom_path).
This chart shows the total number of COVID-19 cases for each day in Morocco between January 1, 2020 and May 3, 2021. The red bars show the reported case numbers for each day while the black line show the 7-day average of cases. We can see that there are several dates with substantially higher numbers of cases compared to the neighbouring dates. This could be due to increased testing on specific days but it is more likely due to delays in reporting leading to a backlog of cases reported on specific days. The black line “smooths” out these differences, allowing us to see the overall trend.
There are many R packages available for applying method from Geographic Information Systems (GIS) to your data.
The Epidemiologist R handbook has an extensive section - 28: GIS basics
Some of the examples touch on more advanced Spatial Epidemiology techniques which can be used to derive insights from your data frame. For the purpose of this training, we will focus on a sub-section of the GIS basics: 28.8 Mapping with ggplot2
There are several key terms which are defined in the Epidemiologist R handbook - 28.2 Key terms
Understanding what these terms mean will help you to understand the main concepts in this section. For example, we will be using a shapefile to tell R the location where we want to build a map. As per section 28.2, a shapefile can be defined as
“a common data format for storing”vector" spatial data consisting or lines, points, or polygons. A single shapefile is actually a collection of at least three files - .shp, .shx, and .dbf. All of these sub-component files must be present in a given directory (folder) for the shapefile to be readable. These associated files can be compressed into a ZIP folder to be sent via email or download from a website."
Shapefiles are often made available free of charge by official government bodies or international humanitarian organisations. For this example, we will be using a shapefile of countries in Africa provided by Africa CDC.
Spatial data can be complex, particularly if the files cover a large area with many complex borders. To simplify the use of spatial data, several packages have been developed to apply the “tidy” data approach to shapefiles. One example is a package called sf.
Before using sf to read the data and assign it to an object in R, you will need to unzip the shapefile that you have downloaded. When you unzip this file you will see several files with different endings e.g. “.dbf”, “.prj”, “.shp”.
The file we want to import ends in “.shp” but we also want to keep the other files in the folder. The other files contain useful information about the shapefile which GIS programs can use to correctly import the shapefile.
pacman::p_load(sf)
africa_shp <- read_sf(here('data', 'country_boundaries', 'Country_Boundaries.shp'))
The shapefile has now been loaded and added to an object called africa_map. We can look at this object to understand more about the shapefile
africa_shp
## Simple feature collection with 55 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -2822922 ymin: -4141244 xmax: 7069015 ymax: 4517454
## Projected CRS: WGS 84 / Pseudo-Mercator
## # A tibble: 55 x 2
## COUNTRY_NA geometry
## <chr> <MULTIPOLYGON [m]>
## 1 Namibia (((1687094 -3075056, 1687125 -3075073, 1687147 -3075115, 16871…
## 2 South Africa (((2161609 -4121371, 2161614 -4121429, 2161630 -4121442, 21616…
## 3 Botswana (((2801698 -2011788, 2802243 -2011806, 2802414 -2011767, 28025…
## 4 Angola (((1304707 -1864056, 1304705 -1864343, 1304800 -1864443, 13048…
## 5 Guinea Bissau (((-1692512 1217444, -1693531 1216231, -1693791 1216278, -1693…
## 6 Liberia (((-1085740 954849.8, -1084902 953191.7, -1083728 951943.4, -1…
## 7 Sierra Leone (((-1279803 773261.3, -1280291 773183.9, -1283912 775130.7, -1…
## 8 Guinea (((-1483303 1025831, -1483346 1025844, -1483366 1025887, -1483…
## 9 Cote d'Ivoire (((-692386.1 1201332, -692088.9 1201295, -691820.6 1201351, -6…
## 10 Burkina Faso (((-50724.95 1698516, -49208.78 1697051, -46751.96 1693717, -4…
## # … with 45 more rows
# names(africa_map) #to look at the names of the columns in the dataframe
# head(africa_map) #to look at the first few rows of data
There are two variables in the dataset - COUNTRY_NA & geometry. The geometry variable is the most important variable when working with sf objects. This contains all of the geographical co-ordinate information which lets R knows how to plot the map.
As we have previously used ggplot2 to visualise data, we will continue to use that package for making simple maps. There are many other packages which can be used to make maps in R: leaflet, tmap, mapbox.
For this example, we want to map the total number of confirmed COVID-19 cases in North Africa. The shapefile we have loaded contains geographical coordinate information for all countries in Africa. We can use methods from previous sections to filter the shapefile to only include countries in Northern Africa.
#check which countries are in the northern_africa_cases_country data frame
northern_african_countries <- unique(northern_africa_cases_country$country)
northern_african_countries
## [1] "Morocco" "Tunisia" "Egypt" "Libya" "Algeria"
## [6] "Mauritania"
#n=6
north_africa_shp <- africa_shp %>%
sf::st_simplify() %>%
rename(country=COUNTRY_NA) %>% #rename COUNTRY_NA to make it easier to follow the example
filter(country %in% northern_african_countries) #filter to only include countries in the northern_african_countries data frame